Improving the Performance of Text Categorization using N-gram Kernels
نویسندگان
چکیده
Kernel Methods are known for their robustness in handling large feature space and are widely used as an alternative to external feature extraction based methods in tasks such as classification and regression. This work follows the approach of using different string kernels such as n-gram kernels and gappy-n-gram kernels on text classification. It studies how kernel concatenation and feature combination affects the classification accuracy of the system. It also explores how the kernel combination algorithms work on the system. The kernels are implemented as rational kernels, which satisfies the Mercer’s Theorem ensuring the kernel matrices to be positive definite symmetric. The rational kernels are computed with a general algorithm of composition of weighted transducers which help in dealing with variable length sequences. These kernels are then used with SVM formulating efficient classifier for text categorization. Both one-stage and two stage algorithms are applied for kernel combination which were successful in achieving better system performance compared to that given by individual kernels.
منابع مشابه
Can characters reveal your native language? A language-independent approach to native language identification
A common approach in text mining tasks such as text categorization, authorship identification or plagiarism detection is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. In this work, an approach that uses character n-grams as features is proposed for the task of native language identification. Instead of doing standard feature selection,...
متن کاملA Study Using n-gram Features for Text Categorization
In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...
متن کاملText Categorization Techniques for Intrusion Detection -- A N-Gram-Based Method
Text categorization techniques have been used in anomaly intrusion detection by Liao and Vermuri in USENIX 02 paper. [1] Another n-gram-based text categorization method proposed in this report is expected to improve the performance of intrusion detection system that implements Liao’s method.
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملA Comparison of Text-Categorization Methods Applied to N-Gram Frequency Statistics
This paper gives an analysis of multi-class e-mail categorization performance, comparing a character n-gram document representation against a word-frequency based representation. Furthermore the impact of using available e-mail specific meta-information on classification performance is explored and the findings are presented.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015